The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
We propose and study a novel problem of mining news text and social media jointly to discover controversial points in news, which enables many applications such as highlighting controversial points in news articles for readers, revealing controversies in news and their trends over time, and quantifying the controversy of a news source. We design a controversy scoring function to discover the most...
As the ability to store and process massive amounts of user behavioral data increases, new approaches continue to arise for leveraging the wisdom of the crowds to gain insights that were previously very challenging to discover by text mining alone. For example, through collaborative filtering, we can learn previously hidden relationships between items based upon users' interactions with them, and...
Semantic Knowledge is usually adding into topic model to improve topic coherence. However, it's hard to judge whether semantic information is related to topic without using complicated lexical characteristics. In this paper, we demonstrate a novel model called Cloud Transformation Model, which can easily judge whether semantic information is related to topic, and integrate semantic information into...
This paper proposes a Contrarian Probabilistic Model (CPM) to evaluate the effectiveness of contrarians' investment in preferred stocks using big data from Tradeline. CPM accommodates the unique features of investment data which are often correlated, nested, heterogeneous, non-normal with missing values. The clustering and statistical inference are integrated in CPM, which enables joint investment...
Big data is a broad data set that has been used in many fields. To process huge data set is a time consuming work, not only due to its big volume of data size, but also because data type and structure can be different and complex. Currently, many data mining and machine learning technique are being applied to deal with big data problem; some of them can construct a good learning algorithm in terms...
Understanding bike trip patterns in a bike sharing system is important for researchers designing models for station placement and bike scheduling. By bike trip patterns, we refer to the large number of bike trips observed between two stations. However, due to privacy and operational concerns, bike trip data are usually not made publicly available. In this paper, instead of relying on time-consuming...
A mechanism for identifying bandings in large "zero-one" N-dimensional data sets, using a sampling technique, is presented. The challenge of identifying bandings in data is the large number of potential permutations that need to be considered. To circumvent this a banding score mechanism is proposed that avoids the need to consider large numbers of permutations. This has been incorporated...
Preference (top-k) queries play a key role in modern data analytics tasks. Top-k techniques rely on ranking functions in order to determine an overall score for each of the objects across all the relevant attributes being examined. This ranking function is provided by the user at query time, or generated for a particular user by a personalized search engine which prevents the pre-computation of the...
The usage of large amounts of data has an immense potential for global economic growth and the competitiveness of countries with high technological standards. Vast amounts of data from different sources are collected and analyzed in order to seek economic profit and competitive advantages for companies and society in general. To gain profit from such data, it needs to be analyzed, processed, and interpreted...
This paper discusses a project that studied the relationship between citizen trust and social protest using visual analysis of approximately 11 million sentiment classified Tweets from the period of the 2014 Brazilian World Cup. The results of the study reveal that the 2014 World Cup protests in Brazil sprang from a wide range of grievances coupled with a relative sense of deprivation compared with...
A system is developed to redact personally identifiable information (PII) through a combination of entity recognition, regular expressions, and machine learning with very high precision from millions of medical transcriptions. This system is trained and tested with manually redacted medical transcriptions using an internally developed coding system, providing double blind classification capabilities.
Achieving high quality clustering is one of the most well-known problems in data mining. k-means is by far the most commonly used clustering algorithm. It converges fairly quickly, but achieving a good solution is not guaranteed. The clustering quality is highly dependent on the selection of the initial centroid selections. Moreover, when the number of clusters increases, it starts to suffer from...
We review in this paper several methods from Statistical Learning Theory (SLT) for the performance assessment and uncertainty quantification of predictive models. Computational issues are addressed so to allow the scaling to large datasets and the application of SLT to Big Data analytics. The effectiveness of the application of SLT to manufacturing systems is exemplified by targeting the derivation...
The state-of-the-art scheduler of containerized cloud services considers load-balance as the only criterion and neglects many others such as application performance. In the era of Big Data, however, applications have evolved to be highly data-intensive thus perform poorly in existing systems. This particularly holds for Platform-as-a-Service environments that encourage an application model of stateless...
Over the past years, frameworks such as MapRe-duce and Spark have been introduced to ease the task of developing big data programs and applications. However, the jobs in these frameworks are roughly defined and packaged as executable jars without any functionality being exposed or described. This means that deployed jobs are not natively composable and reusable for subsequent development. Besides,...
Today big data is synonymous with every business and organization, so much so that data brokers have made a business of trading this big data like any other commodity. In turn, the buyers of this big data make massive profits. The only one who loses out on profits and his privacy is the internet user — the generator and owner of this big data. Our work looks at allowing the user to monetize on his...
In this paper, we address the problem of data confidentiality in big data analytics. In many fields, much useful patterns can be extracted by applying machine learning techniques to big data. However, data confidentiality must be protected. In many scenarios, data confidentiality could well be a prerequisite for data to be shared. We present a scheme to provide provable secure data confidentiality...
This paper studies one-scan approximation algorithms for streaming data mining (SDM). Despite of the importance of pattern discovery in streaming data, this issue has not sufficiently addressed yet in the big data community. In this context, we briefly review the previously proposed SDM methods. There is a recent work to improve their limitation using the tecnique of online compression. It is based...
Large amount of data is being generated every day and is creating new challenges and opportunities which lead to extraordinary new knowledge and discoveries in many application domains ranging from science and engineering to business. One of the main challenges in this era of Big Data is how to efficiently manage and analyse such scale of data. This is challenging not only due to the size of the data,...
Rumor detection in streaming social media is a significant but challenging problem. In this paper, we present a method to identify rumor patterns in the streaming social media environment. Patterns which combine both structural and behavioral properties of rumor are firstly proposed to distinguish false rumors from valid news. A novel graph-based pattern matching algorithm is also described to detect...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.